ChemDataWriter - Reader

Reader is built upon the Document class of BatteryDataExtractor.

It is specifically designed to process raw XML/HTML paper files and convert them into a Reader object, which can then be used to extract and analyze the content of the paper in a structured and meaningful way.

Features

Pre-process HTML/XML files from three publishers: RSC, Elsevier, and Springer

Extract metadata: title, author, date, journal, issue, abstract, ...

Support user-defined paper files in the JSON format

Enable large-scale scientific paper retrieval and analysis

ChemDataWriter - Finder

Finder is designed to suggest key topics in a corpus of scientific research papers.

Finder is built upon the BERTopic library, which leverages transformers to identify and classify the key themes and topics within a corpus of documents.

Features

Identify relevant topics within a large collection of scientific papers

Gain insights into emerging trends and topics within a particular field

Build your own models: embeddings, dimensionality reduction, clustering, tokeniser, weighting scheme, representation tuning

Visualise the topics per class or over time; calculate the probability of each topic

ChemDataWriter - Retriever

Retriever is a powerful tool for efficient and effective text retrieval in scientific paper corpora given a topic.

Built upon the Haystack library, Retriever uses transformers-based models to search, query, and re-rank large volumes of text of scientific documents.

Features

Rank the text of a corpus of a given topic according to the frequency, word order and syntax

Performs document retrieval by sweeping through text which was saved in a DocumentStore

Select various retrieval methods: BM25, DensePassage, TableText, Embedding, Tfidf, MultiModal, Web

Define the number of documents that need to retrieve and re-rank by importance or date