Sources and import formats

Supported bibliographic sources, file formats, and API connectors.

Overview

biblioflow separates source from format:

source is the database or provider, such as scopus, wos, openalex, crossref, pubmed, or pmc.
format is the file format, such as csv, bibtex, ris, json, or plain_text.

For convenience, both can be inferred:

import biblioflow as bf

dataset = bf.load("records.ris")
dataset = bf.load("scopus_export.csv")
dataset = bf.load("savedrecs.txt")

When the source is known, pass it explicitly:

dataset = bf.load("savedrecs.txt", source="wos")
dataset = bf.load("scopus.csv", source="scopus")
dataset = bf.load("openalex.json", source="openalex")
dataset = bf.load("crossref.json", source="crossref")
dataset = bf.load("pubmed_records.nbib", source="pubmed")

The older provider= keyword remains supported for compatibility, but new examples should prefer source=.

Supported file inputs

Source	Formats	Typical files	Notes
Generic BibTeX	BibTeX	`.bib`, `.bibtex`	Common citation-manager and LaTeX format.
Generic RIS	RIS	`.ris`	Common reference-manager and database export format.
Scopus	CSV, BibTeX	`.csv`, `.bib`	CSV headers are mapped to the canonical schema.
Web of Science	Plain text, BibTeX	`savedrecs.txt`, `.bib`	Plain-text exports support repeated and multiline tags.
OpenAlex	JSON, CSV-like generic records	`.json`, `.csv`	JSON API response shapes with `results` are detected.
Crossref	JSON	`.json`	Work-list responses with `message.items` are detected.
PubMed	NBIB, XML	`.nbib`, `.xml`	PubMed/NLM files with PMID, DOI, MeSH, and journal fields.
PubMed Central	XML, API records	`.xml`, API results	PMC records can include PMCID, full-text URL, and full text when available.
Generic records	JSON, JSONL, CSV, TSV, YAML	`.json`, `.jsonl`, `.csv`, `.tsv`, `.yaml`

Canonical fields

Source-specific exports are normalized to one record schema. Common fields include:

source
source_id
pmid
pmcid
doi
title
abstract
full_text
authors
authors_raw
year
publication_year
publication_date
journal
source_title
document_type
language
volume
issue
pages
article_number
author_keywords
keywords_author
keywords_index
keywords_all
references
references_raw
full_text_url
affiliations
institutions
countries
cited_by_count
raw

The raw field preserves the source-specific payload inside each normalized record. The dataset also stores the raw records separately in dataset.raw when keep_raw=True.

Source aliases

source aliases are normalized before dispatch:

Alias	Normalized source
`wos`, `webofscience`, `web-of-science`	`web_of_science`
`scopus`	`scopus`
`openalex`	`openalex`
`crossref`	`crossref`
`bib`, `bibtex`	`bibtex`
`ris`	`ris`
`pubmed`	`pubmed`
`pmc`, `pubmed_central`, `pubmed-central`, `pubmedcentral`, `pmcid`	`pmc`

Format aliases are also normalized. For example, bib becomes bibtex and txt becomes plain_text.

Web of Science plain text

Web of Science plain-text exports are parsed directly:

dataset = bf.load("savedrecs.txt")

The parser handles:

ER-delimited records.
Multiline fields, such as long titles and abstracts.
Repeated fields, such as authors, addresses, emails, and references.
Core tags including AU, AF, TI, SO, DE, ID, AB, CR, TC, PY, DI, UT, WC, and SC.

Useful normalized fields include source_id, authors, source_title, keywords_author, keywords_index, references_raw, cited_by_count, wos_categories, and research_areas.

PubMed and PubMed Central API search

PubMed and PubMed Central searches use pymedx, which is installed with the core package. Use PubMed query syntax and provide an NCBI contact email either directly or through an environment variable:

export BIBLIOFLOW_NCBI_EMAIL="researcher@example.org"
# Optional, when you have an NCBI API key:
export BIBLIOFLOW_NCBI_API_KEY="..."

import biblioflow as bf

pubmed = bf.from_pubmed(
    query="bibliometrics AND reproducibility",
    limit=100,
)

pmc = bf.from_pmc(
    query="science mapping",
    limit=50,
)

from_pubmed_central(...) is the long-form name for from_pmc(...). PubMed records populate fields such as pmid, doi, title, abstract, authors, source_title, publication_date, keywords_index, and url. PMC records also populate pmcid, full_text_url, open_access_url, and full_text when those values are returned by the source.

The loader intentionally requires a contact email before making API calls. Pass email="you@example.org" or set one of BIBLIOFLOW_NCBI_EMAIL, NCBI_EMAIL, or ENTREZ_EMAIL.

Scopus CSV

Scopus CSV exports are loaded through the same API:

dataset = bf.load("scopus.csv")

Scopus-specific columns such as Authors, Author full names, Title, Year, Source title, Cited by, DOI, Affiliations, References, and EID are mapped to canonical fields. UTF-8 BOM files are supported.

OpenAlex and Crossref JSON

Saved API responses can be loaded as files:

openalex = bf.load("openalex.json", source="openalex")
crossref = bf.load("crossref.json", source="crossref")

OpenAlex JSON supports a single work, a list of works, or a response object with results. OpenAlex abstract inverted indexes are reconstructed into plain text.

Crossref JSON supports a single work, a list of works, or a work-list response with message.items. Crossref date-parts are normalized into ISO-like dates.

API connectors

File loading uses bf.load(...). API imports use dedicated functions:

records = bf.from_openalex(search="bibliometrics", limit=100)
records = bf.from_crossref(query="science mapping", limit=100)
pubmed = bf.from_pubmed(
    query="bibliometrics AND reproducibility",
    limit=100,
    email="researcher@example.org",
)
pmc = bf.from_pubmed_central(
    query="open science",
    limit=50,
    email="researcher@example.org",
)

Scopus API access requires optional dependencies and local pybliometrics configuration:

records = bf.from_scopus(
    query="TITLE-ABS-KEY(bibliometrics)",
    limit=100,
)

If Scopus API support is not installed or configured, biblioflow raises a structured configuration/dependency error.

Screening before analysis

The core library returns normalized datasets. Applications such as biblioflow-web and biblioflow-nb can stage those normalized records as source-agnostic screening runs before creating an analysis dataset. Use this intermediate step when a search or upload may contain irrelevant records, duplicates, or uncertain candidates:

import biblioflow_nb as bfn

app = bfn.app(display=False)
run = app.stage_file("records.ris", source="auto", format="auto")
app.update_candidates([run["candidates"][0]["candidate_id"]], status="selected")
app.promote_candidates()

In the web app, the Import or Load page creates screening runs from uploaded files and supported remote API sources. Candidate review happens on the dedicated Screening page. Direct bf.load(...) and bf.from_* calls remain available when no screening step is needed.