Sources and import formats

Supported bibliographic sources, file formats, and API connectors.

Overview

biblioflow separates source from format:

  • source is the database or provider, such as scopus, wos, openalex, crossref, pubmed, or pmc.
  • format is the file format, such as csv, bibtex, ris, json, or plain_text.

For convenience, both can be inferred:

import biblioflow as bf

dataset = bf.load("records.ris")
dataset = bf.load("scopus_export.csv")
dataset = bf.load("savedrecs.txt")

When the source is known, pass it explicitly:

dataset = bf.load("savedrecs.txt", source="wos")
dataset = bf.load("scopus.csv", source="scopus")
dataset = bf.load("openalex.json", source="openalex")
dataset = bf.load("crossref.json", source="crossref")
dataset = bf.load("pubmed_records.nbib", source="pubmed")

The older provider= keyword remains supported for compatibility, but new examples should prefer source=.

Supported file inputs

Source Formats Typical files Notes
Generic BibTeX BibTeX .bib, .bibtex Common citation-manager and LaTeX format.
Generic RIS RIS .ris Common reference-manager and database export format.
Scopus CSV, BibTeX .csv, .bib CSV headers are mapped to the canonical schema.
Web of Science Plain text, BibTeX savedrecs.txt, .bib Plain-text exports support repeated and multiline tags.
OpenAlex JSON, CSV-like generic records .json, .csv JSON API response shapes with results are detected.
Crossref JSON .json Work-list responses with message.items are detected.
PubMed NBIB, XML .nbib, .xml PubMed/NLM files with PMID, DOI, MeSH, and journal fields.
PubMed Central XML, API records .xml, API results PMC records can include PMCID, full-text URL, and full text when available.
Generic records JSON, JSONL, CSV, TSV, YAML .json, .jsonl, .csv, .tsv, .yaml

Canonical fields

Source-specific exports are normalized to one record schema. Common fields include:

source
source_id
pmid
pmcid
doi
title
abstract
full_text
authors
authors_raw
year
publication_year
publication_date
journal
source_title
document_type
language
volume
issue
pages
article_number
author_keywords
keywords_author
keywords_index
keywords_all
references
references_raw
full_text_url
affiliations
institutions
countries
cited_by_count
raw

The raw field preserves the source-specific payload inside each normalized record. The dataset also stores the raw records separately in dataset.raw when keep_raw=True.

Source aliases

source aliases are normalized before dispatch:

Alias Normalized source
wos, webofscience, web-of-science web_of_science
scopus scopus
openalex openalex
crossref crossref
bib, bibtex bibtex
ris ris
pubmed pubmed
pmc, pubmed_central, pubmed-central, pubmedcentral, pmcid pmc

Format aliases are also normalized. For example, bib becomes bibtex and txt becomes plain_text.

Web of Science plain text

Web of Science plain-text exports are parsed directly:

dataset = bf.load("savedrecs.txt")

The parser handles:

  • ER-delimited records.
  • Multiline fields, such as long titles and abstracts.
  • Repeated fields, such as authors, addresses, emails, and references.
  • Core tags including AU, AF, TI, SO, DE, ID, AB, CR, TC, PY, DI, UT, WC, and SC.

Useful normalized fields include source_id, authors, source_title, keywords_author, keywords_index, references_raw, cited_by_count, wos_categories, and research_areas.

Scopus CSV

Scopus CSV exports are loaded through the same API:

dataset = bf.load("scopus.csv")

Scopus-specific columns such as Authors, Author full names, Title, Year, Source title, Cited by, DOI, Affiliations, References, and EID are mapped to canonical fields. UTF-8 BOM files are supported.

OpenAlex and Crossref JSON

Saved API responses can be loaded as files:

openalex = bf.load("openalex.json", source="openalex")
crossref = bf.load("crossref.json", source="crossref")

OpenAlex JSON supports a single work, a list of works, or a response object with results. OpenAlex abstract inverted indexes are reconstructed into plain text.

Crossref JSON supports a single work, a list of works, or a work-list response with message.items. Crossref date-parts are normalized into ISO-like dates.

API connectors

File loading uses bf.load(...). API imports use dedicated functions:

records = bf.from_openalex(search="bibliometrics", limit=100)
records = bf.from_crossref(query="science mapping", limit=100)
pubmed = bf.from_pubmed(
    query="bibliometrics AND reproducibility",
    limit=100,
    email="researcher@example.org",
)
pmc = bf.from_pubmed_central(
    query="open science",
    limit=50,
    email="researcher@example.org",
)

Scopus API access requires optional dependencies and local pybliometrics configuration:

records = bf.from_scopus(
    query="TITLE-ABS-KEY(bibliometrics)",
    limit=100,
)

If Scopus API support is not installed or configured, biblioflow raises a structured configuration/dependency error.

Screening before analysis

The core library returns normalized datasets. Applications such as biblioflow-web and biblioflow-nb can stage those normalized records as source-agnostic screening runs before creating an analysis dataset. Use this intermediate step when a search or upload may contain irrelevant records, duplicates, or uncertain candidates:

import biblioflow_nb as bfn

app = bfn.app(display=False)
run = app.stage_file("records.ris", source="auto", format="auto")
app.update_candidates([run["candidates"][0]["candidate_id"]], status="selected")
app.promote_candidates()

In the web app, the Import or Load page creates screening runs from uploaded files and supported remote API sources. Candidate review happens on the dedicated Screening page. Direct bf.load(...) and bf.from_* calls remain available when no screening step is needed.