LMS-DB-ETL
An Extract, Transform, Load (ETL) app to gather book information from public APIs for a Proof of Concept Library Management System project.
(Past Git history can be found at: https://github.com/Kalarsoft/LMS-DB-ETL and https://gitea.com/NickKalar/LMS-DB-ETL)
Problem
Currently, I am working on building a Library Management System (LMS) to help develop and showcase my software engineering skills. In order to fully test and run the LMS, I need to have a database that is populated by a variety of different media. As I am one person, and have only about 300 books to my name, this problem needed a better solution that manually adding in those books.
Solution
This project seeks to seed a database with book details, mostly pulled from public APIs. The current version uses the Google Books API and Open Library API. After pulling data from these APIs for several books, the data is merged and transformed to be loaded into a PostgreSQL database for consumption by the RESTful APIs associated with the LMS project.
This is a rudimentary ETL pipeline, as it uses no external tools and uses only 2 Python libraries for making the API calls and connecting to the database. However, it does showcase my understanding of Data Engineering and the ETL cycle.
Setup
Environmental Variables:
GOOGLE_API_KEY - API Key required for using the Google Books API.
DB_NAME - The name of the SQL database being used.
DB_USER - The authorized user for the database.
DB_PASSWORD - The Password to access the database.
LOG_FILE - The file location for logs to be saved to.
extract.py
The extract.py file contains functions to pull data related to books from different APIs. Currently, this project uses the Google Books and OpenLibrary APIs. The former being the only one that needs an API key.
transform.py
Takes the raw JSON stored by extract.py and transforms the entries into a single entry whose keys
match the column names of the database schema.
load.py
Takes the JSON file created by transform.py and loads the data into a PostgreSQL database for
retrieval later.
orchestrator.py
Handles the orchestration of each program being ran one after the other. Ensures each
executes with no fatal errors before moving on to the next. Also cleans up files created
by the programs before ending.
config/title.txt
A file with a list of book titles. Titles do not need to be in order, however each title needs to be on its own line and any special characters should be escaped.
How To Use
- Create a virtual environment (optional, but best practice)
python3 -m venv ./.venv
source ./.venv/bin/activate
- Use Pip to install all required packages
pip install -r requirements.txt
- Run the Orchestrator:
python src/orchestrator.py
OR
python3 src/orchestrator.py