Skip to content

[WWW'26] StreamFP: Fingerprint-guided Data Selection for Efficient Stream Learning

License

Notifications You must be signed in to change notification settings

CGCL-codes/StreamFP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

48 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Here is the cleaned-up version of your markdown without the abnormal characters:


StreamFP: Fingerprint-guided Data Selection for Efficient Stream Learning

Python 3.8+ PyTorch 1.12+ License: MIT Conference

πŸ“’ News

  • [April 2026] StreamFP has been accepted to The Web Conference 2026 (WWW '26)!

πŸ“– Overview

StreamFP is a novel stream learning framework designed to handle non-stationary data streams with high efficiency and robustness against catastrophic forgetting. It introduces learnable fingerprintsβ€”compact parameter vectors that summarize the model stateβ€”to guide data selection processes.

Key challenges in Stream Learning (SL) addressed by StreamFP:

  1. Data Redundancy: Incoming streams often contain redundant data that wastes computation.
  2. Catastrophic Forgetting: Incremental updates can overwrite earlier knowledge.
  3. Efficiency: Traditional model-based selection is often too computationally expensive for real-time streams.

StreamFP achieves superior accuracy and efficiency compared to state-of-the-art methods (e.g., Camel, ER, GradMatch) across varying data arrival rates.

πŸš€ Methodology

StreamFP consists of three key components driven by a shared set of learnable fingerprints [cite: 141-144]:

StreamFP Framework
  1. Fingerprint-based Coreset Selection (FCS): Selects informative samples from incoming batches based on fingerprint similarity, prioritizing data that balances novelty and familiarity.
  2. Fingerprint-based Buffer Update (FBU): Dynamically maintains the replay buffer by preserving representative historical samples and discarding redundant ones.
  3. Fingerprint Attunement (FA): A lightweight plugin that uses pre-trained ViT attention to calibrate fingerprints online with negligible overhead.

πŸ› οΈ Installation

Prerequisites

  • Linux or macOS
  • Python 3.8+
  • PyTorch 1.12+ and CUDA 11.3+

Setup

# Clone the repository
git clone https://github.com/CGCL-codes/StreamFP.git
cd StreamFP

# Create and activate conda environment
conda env create -f environment.yml
conda activate sl

# (Optional) Install FastMoE (main path: build without NCCL)
# NOTE: FastMoE builds a CUDA extension. If you see errors like "nccl.h: No such file or directory",
# you can build without NCCL by setting USE_NCCL=0 (recommended unless you need NCCL-based distributed comm).
conda install -y cmake ninja

git clone --recursive https://github.com/laekov/fastmoe.git
cd fastmoe

# Option 1: disabling distributed features
USE_NCCL=0 python setup.py install

# Option 2: enabling distributed features
python setup.py install

# Quick check
python -c "import fmoe, fmoe_cuda; print('FastMoE installed:', fmoe_cuda.__file__)"
cd ..

πŸ“‚ Datasets

Create a data/ directory in the project root.

Download the datasets and extract them into the corresponding dataset folders under data/.

sh core50.sh

⚑ Quick Start

Basic Usage

To run a standard experiment, use the scripts provided in experiments/:

# Run Clear10 experiment
sh experiments/clear10.sh

# Run Clear100 experiment
sh experiments/clear100.sh

# Run Core50 experiment
sh experiments/core50.sh

# Run Stream-51 experiment
sh experiments/stream51.sh

Custom Configuration

You can customize the training by modifying the arguments in run.py. Key arguments include:

  • --selection_method: Strategy for coreset selection (e.g., StreamFP, Camel, Random).
  • --update_method: Strategy for buffer update (e.g., StreamFP, ER, GSS).
  • --skip_batch: Enable batch skipping for high-speed streams (default: 1).
    • 0: no skipping (process every batch)
    • k > 0: after processing one batch, skip the next k batches (reduces processing frequency)
  • --traintime_limit: Per-batch training time budget to simulate real-time constraints.

Example command:

python -u run.py --config configs/clear10.yaml \
  --repeat 1 --overwrite 1 \
  --selection_method StreamFP --update_method StreamFP \
  --mem_size 102 --traintime_limit 10

πŸ“Š Results

StreamFP consistently outperforms baselines in both Accuracy and Forgetting metrics. Below is a comparison on Stream-51 and Clear10 datasets:

Dataset Method Accuracy (%) Forgetting (%) Runtime (s)
Stream-51 ER 59.99 3.70 1883.75
StreamFP 64.44 2.25 2049.52
Clear10 ER 51.90 1.09 412.50
StreamFP 54.94 0.82 448.80

Detailed results can be found in the results_log/ directory after training.

πŸ“œ Citation

If you find this work useful for your research, please cite our WWW '26 paper:

@inproceedings{li2026streamfp,
  title={StreamFP: Fingerprint-guided Data Selection for Efficient Stream Learning},
  author={Li, Changwu and Shi, Tongjun and Zhang, Shuhao and Chen, Binbin and He, Bingsheng and Liao, Xiaofei and Jin, Hai},
  booktitle={Proceedings of the ACM Web Conference 2026 (WWW '26)},
  year={2026},
  publisher={ACM},
  address={Dubai, United Arab Emirates},
  doi={10.1145/3774904.3792584}
}

πŸ“ License

This project is licensed under the MIT License - see the LICENSE file for details.


About

[WWW'26] StreamFP: Fingerprint-guided Data Selection for Efficient Stream Learning

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •