Core workflows
Workflow overview
A typical biblioflow pipeline looks like this:
- Load bibliographic records from a file or existing data structure.
- Normalize provider-specific fields into the canonical schema.
- Inspect metadata, warnings, and validation output.
- Optionally deduplicate, enrich, or filter the dataset.
- Run descriptive analysis, matrix construction, network construction, or science-mapping helpers.
- Export datasets or derived results.
import biblioflow as bf
dataset = bf.load("records.bib", source="auto", format="auto")
dataset = bf.deduplicate(dataset)
summary = bf.summarize_dataset(dataset)
analysis = bf.analyze(dataset)
net = bf.network(dataset, kind="co_occurrence", unit="keywords_all")
bf.export(net, "network.gexf", format="gexf")Canonical dataset
BibliographicDataset stores the normalized table plus context:
| Attribute | Purpose |
|---|---|
data |
DataFrame-like normalized records. Uses pandas when available, otherwise a lightweight fallback. |
raw |
Raw records before normalization for traceability. |
metadata |
Format, provider, load time, record counts, and workflow metadata. |
warnings |
Structured load warnings. |
errors |
Non-fatal errors collected during loading or validation. |
Each canonical record also has a raw field when keep_raw=True, which makes it possible to inspect the original source payload beside normalized fields.
Frequently used normalized columns include:
source
source_id
pmid
pmcid
doi
title
abstract
full_text
authors
authors_raw
publication_year
publication_date
source_title
journal
keywords_author
keywords_index
keywords_all
references
references_raw
full_text_url
affiliations
institutions
countries
cited_by_count
raw
Useful methods:
dataset.to_records()
dataset.to_dataframe(schema="canonical")
dataset.to_dataframe(schema="bibliometrix")
dataset.warning_dicts()
dataset.to_json("records.json")
dataset.to_csv("records.csv")Summaries and validation
Use reusable summary helpers when an application needs small JSON payloads:
import biblioflow as bf
summary = bf.summarize_dataset(dataset)
import_summary = bf.summarize_import(dataset)
summary.to_dict()
import_summary.to_dict()The CLI validate command exposes warnings and errors as JSON:
biblioflow validate records.risDeduplication and enrichment
deduped = bf.deduplicate(dataset)deduplicate() is intended for conservative local deduplication. Local enrichment helpers should also live in the core package so every interface can reuse the same behavior.
Source-specific loading
Use source= when an export is known:
scopus = bf.load("scopus.csv", source="scopus")
wos = bf.load("savedrecs.txt", source="wos")
openalex = bf.load("openalex.json", source="openalex")
crossref = bf.load("crossref.json", source="crossref")
pubmed = bf.load("records.nbib", source="pubmed")The loader detects many common cases automatically, but explicit source= is recommended for ambiguous .csv, .txt, and .json files.
Dedicated API helpers are available for live metadata retrieval:
openalex = bf.from_openalex(search="bibliometrics", limit=100)
crossref = bf.from_crossref(query="science mapping", limit=100)
pubmed = bf.from_pubmed(
query="bibliometrics AND reproducibility",
limit=100,
email="researcher@example.org",
)
pmc = bf.from_pmc(
query="open science",
limit=50,
email="researcher@example.org",
)PubMed and PubMed Central datasets are regular BibliographicDataset objects, so the same filtering, analysis, network, mapping, and export calls apply. PMC records can include pmcid, full_text_url, and full_text in addition to standard bibliographic metadata.
For a full mapping of supported providers and formats, see Sources and import formats.
Filters
Filters are represented by DatasetFilterSpec, which is serializable and shared by the core library, web backend, and notebook app.
spec = bf.DatasetFilterSpec(
year_min=2020,
year_max=2026,
keywords=["bibliometrics", "open science"],
min_global_citations=5,
)
filtered = bf.filter_dataset(dataset, spec)
filtered.dataset.to_records()
filtered.to_dict()Available filter values can be derived from any dataset:
options = bf.available_filter_values(dataset)
options.to_dict()Descriptive analysis
analysis = bf.analyze(dataset, top_n=20)
payload = analysis.to_dict()The payload contains:
main_informationannual_productiontop_authorstop_sourcestop_keywordsmetadata
Matrix construction
bf.matrix() builds incidence, co-occurrence, collaboration, co-citation, bibliographic coupling, and direct-citation matrices depending on kind and unit.
mat = bf.matrix(
dataset,
kind="co_occurrence",
unit="keywords_all",
normalize="association",
min_occurrences=2,
)Common units include authors, keywords, sources, countries, affiliations, and reference identifiers when present in the dataset.
Network construction
bf.network() converts matrix-like relationships into node/edge records with basic metrics.
net = bf.network(
dataset,
kind="co_occurrence",
unit="keywords_all",
normalize="association",
min_occurrences=2,
)
net.to_dict()Mapping helpers
Lightweight science-mapping helpers are available from the main namespace:
themes = bf.map_themes(dataset, field="keywords_all")
evolution = bf.trace_themes(dataset, field="keywords_all", by="publication_year")
concepts = bf.conceptual_structure(dataset, field="keywords_all")
history = bf.historiograph(dataset)These helpers are intentionally conservative. More advanced clustering and layout techniques should be implemented in the core package before being wired into the applications.
Export formats
| Object | Common formats |
|---|---|
| Dataset | JSON, CSV, optional YAML |
| Matrix | JSON, CSV |
| Network | JSON, GraphML, GEXF, Pajek, VOSviewer text |
bf.export(dataset, "records.json", format="json")
bf.export(mat, "matrix.csv", format="csv")
bf.export(net, "network.graphml", format="graphml")