Please enable JavaScript.
Coggle requires JavaScript to display documents.
Find Publicly Available Institution with Representative Data, Collection…
-
-
- From each .txt stored per collection in a folder, combine into a singular .txt to file to be turned into an accessible database in CSV format.
Loop through the input folder, and read each file, copy the information stored within and write to a new combined.txt file, making sure to keep spacing as not to merge information.
Considering the the output of the final .txt file, an attempt of creating a system of extracting and categorizing the text into a CSV format was attempted.
Some results were created, but with inconsistent and error prone outputs this approach was abandoned.
(CITE ARTICLE KOHA) Inspired by this methodology, but reversed as to convert from MARC to CSV, PyMarc and 3rd party app MarcEdit was considered.
As a result, attempts in standardizing the MARC data as to fit these applications was done.
-
- For this section the 3rd party app MarcEditor is used in order to do multiple conversions until a .csv dataframe is created.
First, with the tools available program, the .txt file's record structure is validated as to see if any records were incorrectly formatted. Errors appear that do not have a significant impact, but most importantly the record structure is recognized
With the record structure recognized, the .txt file is taken in as input and converted into the human-readable
- After the collection funds are collected, we move onto the virtual library catalogue. Specifically looking at the "manuscript" section.
Using the structure from the fund collecting done from before, the fund identifier is used as a search input for the virtual library this is done starting from F1 to F451
In order to narrow down the search to individual collection items, the "Bibliotekos fonde" (roughly translated as in the library's physical collection)
In order to get an overview of how many collection items we will be dealing with and to create a hierarchy on which collections will take the most time, a scan of how many total entries are present within a singular collection is collected from the data that mentions how many items are listed at the top of a search
From the insight gathered on the collection items from the full scan, the actual collection of all the information connected to an item could now be collected. (STEP BY STEP EXPLANATION)
Initializing computer interactable webdriver with the Selenium Python library, specifically using the Firefox Geckodrive for my chosen browser.
The webdriver opens the website on a link that is on the virtual library manuscript section search bar.
A series of counts are defined to keep the track of which collection is to be collected, used to increase the fund number count in the search bar, and to appropriately change pages on larger collections.
- 1 more item...
-
1.At first the fund inventory overview is taken a look at, there are 451 total registered funds.
The data is contained within clickable pages, each fund starts with the letter F and followed by a number. this is take note of as this structure will be used in searching for individual items in each collection.
In order to not manually click through each of the funds, the provided list of each fund name in PDF format is copied to be used with a webdriver with the Selenium Python library to automate a program that clicks through each of the fund pages. When that is done, it takes each of the page's metadata and stores it in a .txt file which can then be used for further analysis.
Of particular interest is the item amount, time-period of the manuscripts contained within a singular fund, and when the fund was received into the library.
-
TOTAL COLLECTED ENDED UP BEING 86,339. !! COMPARE WITH END RESULT !!
-
-
-