Sources and import formats
Overview
biblioflow separates source from format:
sourceis the database or provider, such asscopus,wos,openalex,crossref,pubmed, orpmc.formatis the file format, such ascsv,bibtex,ris,json, orplain_text.
For convenience, both can be inferred:
import biblioflow as bf
dataset = bf.load("records.ris")
dataset = bf.load("scopus_export.csv")
dataset = bf.load("savedrecs.txt")When the source is known, pass it explicitly:
dataset = bf.load("savedrecs.txt", source="wos")
dataset = bf.load("scopus.csv", source="scopus")
dataset = bf.load("openalex.json", source="openalex")
dataset = bf.load("crossref.json", source="crossref")
dataset = bf.load("pubmed_records.nbib", source="pubmed")The older provider= keyword remains supported for compatibility, but new examples should prefer source=.
Supported file inputs
| Source | Formats | Typical files | Notes |
|---|---|---|---|
| Generic BibTeX | BibTeX | .bib, .bibtex |
Common citation-manager and LaTeX format. |
| Generic RIS | RIS | .ris |
Common reference-manager and database export format. |
| Scopus | CSV, BibTeX | .csv, .bib |
CSV headers are mapped to the canonical schema. |
| Web of Science | Plain text, BibTeX | savedrecs.txt, .bib |
Plain-text exports support repeated and multiline tags. |
| OpenAlex | JSON, CSV-like generic records | .json, .csv |
JSON API response shapes with results are detected. |
| Crossref | JSON | .json |
Work-list responses with message.items are detected. |
| PubMed | NBIB, XML | .nbib, .xml |
PubMed/NLM files with PMID, DOI, MeSH, and journal fields. |
| PubMed Central | XML, API records | .xml, API results |
PMC records can include PMCID, full-text URL, and full text when available. |
| Generic records | JSON, JSONL, CSV, TSV, YAML | .json, .jsonl, .csv, .tsv, .yaml |
Canonical fields
Source-specific exports are normalized to one record schema. Common fields include:
source
source_id
pmid
pmcid
doi
title
abstract
full_text
authors
authors_raw
year
publication_year
publication_date
journal
source_title
document_type
language
volume
issue
pages
article_number
author_keywords
keywords_author
keywords_index
keywords_all
references
references_raw
full_text_url
affiliations
institutions
countries
cited_by_count
raw
The raw field preserves the source-specific payload inside each normalized record. The dataset also stores the raw records separately in dataset.raw when keep_raw=True.
Source aliases
source aliases are normalized before dispatch:
| Alias | Normalized source |
|---|---|
wos, webofscience, web-of-science |
web_of_science |
scopus |
scopus |
openalex |
openalex |
crossref |
crossref |
bib, bibtex |
bibtex |
ris |
ris |
pubmed |
pubmed |
pmc, pubmed_central, pubmed-central, pubmedcentral, pmcid |
pmc |
Format aliases are also normalized. For example, bib becomes bibtex and txt becomes plain_text.
Web of Science plain text
Web of Science plain-text exports are parsed directly:
dataset = bf.load("savedrecs.txt")The parser handles:
ER-delimited records.- Multiline fields, such as long titles and abstracts.
- Repeated fields, such as authors, addresses, emails, and references.
- Core tags including
AU,AF,TI,SO,DE,ID,AB,CR,TC,PY,DI,UT,WC, andSC.
Useful normalized fields include source_id, authors, source_title, keywords_author, keywords_index, references_raw, cited_by_count, wos_categories, and research_areas.
PubMed and PubMed Central API search
PubMed and PubMed Central searches use pymedx, which is installed with the core package. Use PubMed query syntax and provide an NCBI contact email either directly or through an environment variable:
export BIBLIOFLOW_NCBI_EMAIL="researcher@example.org"
# Optional, when you have an NCBI API key:
export BIBLIOFLOW_NCBI_API_KEY="..."import biblioflow as bf
pubmed = bf.from_pubmed(
query="bibliometrics AND reproducibility",
limit=100,
)
pmc = bf.from_pmc(
query="science mapping",
limit=50,
)from_pubmed_central(...) is the long-form name for from_pmc(...). PubMed records populate fields such as pmid, doi, title, abstract, authors, source_title, publication_date, keywords_index, and url. PMC records also populate pmcid, full_text_url, open_access_url, and full_text when those values are returned by the source.
The loader intentionally requires a contact email before making API calls. Pass email="you@example.org" or set one of BIBLIOFLOW_NCBI_EMAIL, NCBI_EMAIL, or ENTREZ_EMAIL.
Scopus CSV
Scopus CSV exports are loaded through the same API:
dataset = bf.load("scopus.csv")Scopus-specific columns such as Authors, Author full names, Title, Year, Source title, Cited by, DOI, Affiliations, References, and EID are mapped to canonical fields. UTF-8 BOM files are supported.
OpenAlex and Crossref JSON
Saved API responses can be loaded as files:
openalex = bf.load("openalex.json", source="openalex")
crossref = bf.load("crossref.json", source="crossref")OpenAlex JSON supports a single work, a list of works, or a response object with results. OpenAlex abstract inverted indexes are reconstructed into plain text.
Crossref JSON supports a single work, a list of works, or a work-list response with message.items. Crossref date-parts are normalized into ISO-like dates.
API connectors
File loading uses bf.load(...). API imports use dedicated functions:
records = bf.from_openalex(search="bibliometrics", limit=100)
records = bf.from_crossref(query="science mapping", limit=100)
pubmed = bf.from_pubmed(
query="bibliometrics AND reproducibility",
limit=100,
email="researcher@example.org",
)
pmc = bf.from_pubmed_central(
query="open science",
limit=50,
email="researcher@example.org",
)Scopus API access requires optional dependencies and local pybliometrics configuration:
records = bf.from_scopus(
query="TITLE-ABS-KEY(bibliometrics)",
limit=100,
)If Scopus API support is not installed or configured, biblioflow raises a structured configuration/dependency error.
Screening before analysis
The core library returns normalized datasets. Applications such as biblioflow-web and biblioflow-nb can stage those normalized records as source-agnostic screening runs before creating an analysis dataset. Use this intermediate step when a search or upload may contain irrelevant records, duplicates, or uncertain candidates:
import biblioflow_nb as bfn
app = bfn.app(display=False)
run = app.stage_file("records.ris", source="auto", format="auto")
app.update_candidates([run["candidates"][0]["candidate_id"]], status="selected")
app.promote_candidates()In the web app, the Import or Load page creates screening runs from uploaded files and supported remote API sources. Candidate review happens on the dedicated Screening page. Direct bf.load(...) and bf.from_* calls remain available when no screening step is needed.