Reader is built upon the Document class of BatteryDataExtractor.
It is specifically designed to process raw XML/HTML paper files and convert them into a Reader object, which can then be used to extract and analyze the content of the paper in a structured and meaningful way.
Pre-process HTML/XML files from three publishers: RSC, Elsevier, and Springer
Extract metadata: title, author, date, journal, issue, abstract, ...
Support user-defined paper files in the JSON format
Enable large-scale scientific paper retrieval and analysis
Finder is designed to suggest key topics in a corpus of scientific research papers.
Finder is built upon the BERTopic library, which leverages transformers to identify and classify the key themes and topics within a corpus of documents.
Identify relevant topics within a large collection of scientific papers
Gain insights into emerging trends and topics within a particular field
Build your own models: embeddings, dimensionality reduction, clustering, tokeniser, weighting scheme, representation tuning
Visualise the topics per class or over time; calculate the probability of each topic
Retriever is a powerful tool for efficient and effective text retrieval in scientific paper corpora given a topic.
Built upon the Haystack library, Retriever uses transformers-based models to search, query, and re-rank large volumes of text of scientific documents.
Rank the text of a corpus of a given topic according to the frequency, word order and syntax
Performs document retrieval by sweeping through text which was saved in a DocumentStore
Select various retrieval methods: BM25, DensePassage, TableText, Embedding, Tfidf, MultiModal, Web
Define the number of documents that need to retrieve and re-rank by importance or date