Project Documentation
OCTOBER 2020:
- Started Researching Scripts
- Researched OCR programs. Uploaded pdfs of scripts to see if the OCR would work
NOVEMBER 2020:
- Continued research of scripts. Started OCR using Adobe Acrobat Pro
- Played around with the text files in Oxygen to see if the program could read the files accurately.
JANUARY 2021
- Week of 1/17-1/23
- Narrowed down research topic. Deciding what I want to do for DIGIT 494 and what I would like to continue in the following years.
- Documented my sources, what the project will entail, about me, etc.
- Week of 1/24-1/30
- Developed splash page for what is coming next in my planning
- Created Vague About Me page
- Continued Converting PDF to text files
- Hand edited text files that had minor issues with word merging and spelling due to OCR of older scanned pdf documents
FEBRUARY 2021
- Week of 1/31-2/6
- Continued converting PDF to text files
- Hand edited text files that had minor issues with word merging and spelling due to OCR of older scanned pdf documents
- Found trouble converting Goblet of Fire (HUGE watermark—reach out to Matthew Ciszek in PSB library to assist) and Order of the Phoenix (was a .fdr file rather than .pdf—saved it as a .rtf format through oxygen?)
- Began work on Schema to develop rules for upcoming markup
- What kinds of word analysis would be interesting to look at? Distant reading!!!
- Do the screenplays reflect femininity towards the latter half and lower
- J.K. Rowling Accused of Transphobia
- Gender Analysis as a potential keyhole for project
- Look up articles in PSU libraries
- Gendered Representations through speech: the case of the harry potter series Eberhardt, Maeve
- Week of 2/7-2/13
- Decided which elements and attributes to use so they fit well with the script
- Developed first draft of the project schema
- Researched Sex and Gender issues between Rowling and the community
- Began brainstorming specific research questions for project
- Found Goblet of Fire of genius.com (needed to find a new script because the original one had a watermark across every page)
- Decided to incorporate the lack of script sources into my project analysis
- Week of 2/14-2/20
- Updated and revised schema so it would work with future XML markup
- Created the “cat” attribute which says whether a scene occurs inside or outside (in addition to the location of the scene)
- Ran into issues of how to capture various groups of texts since there was strong inconsistencies with the order of text type (speech, stage, etc.)
- Refreshed my memory on RegEx to prepare for the next step of the project
- Week of 2/21-2/27
- Began discussion of the best way to handle the RegEx on the scripts
- Issues with the scripts being inconsistent in both the setup and character names.
- Some of the inconsistencies will be taken care of by hand, but most of it will be fixed with RegEx
- 3 scripts are completely marked up and two more will be complete by next week.
- The main goal is to build a strong foundation before Textanalysis takes place.
- I plan on conducting RegEx for the Half Blood Prince and reviewing the Goblet of Fire, The Order of the Phoenix, and Deathly Hallows Pt. 2
MARCH 2021
- Week of 2/28-3/6
- Refined Half Blood Prince using RegEx
- Reviewed the Order of the Phoenix and The Goblet of Fire for errors and none were found
- I watched a portion of the Goblet of Fire to determine which script (Genius vs. Script Slug) was more accurate based on the final revision.
- Week of 3/7-3/13
- Fixed various tagging errors within The Prisoner of Azkaban and The Half Blood Prince.
- A large portion of time was spent cleaning the text file for the Sorcerer’s Stone.
- Since I had to clean certain text files before conducting RegEx, I was able to make an array of elements, such as spacing, consistent so that future RegEx would be much easier.
- Week of 3/14-3/20
- It was no small accomplishment that the cleaning of The Sorcerer’s text file was finally completed this week. This file was by far the second worst text file to be produced from the OCR.
- When cleaning The Sorcerer’s Stone, I had to fix numerous readability issues such as absence of spacing between words, random character, and additional unnecessary white space between lines. Additionally, there were a few entire pages missing from the stages of OCR on the pdf file to the conversion of the text file.
- Week of 3/21-3/27
- Began conducting RegEx on The Sorcerer’s Stone.
- Refreshed my memory on XPath to prepare for upcoming stages of project.
APRIL 2021
- Week of 3/28-4/3
- Updates and revision made to the RegEx for The Sorcerer’s Stone and The Half Blood Prince.
- There were validation errors being thrown on the scripts, which resulted in tedious editing of the xml tags.
- Header information began to be filled in for each script.
- Week of 4/4-4/10
- Received a workshop on XQuery from Dr. B where we discovered various errors by searching for a Master List of Names across the corpus.
- Added a new and improved menu for the website that is easier to read and navigate.
- Hover colors and menu colors were changed to increase readability and reflect the two common colors represented in the series.
- Dropdown menu was created to further organize the website and make the information easier to find and access.
- Week of 4/11-4/17
- XQuery was fully developed to pull all speeches from the corpus, speech count of the overall scripts, and from the individual scripts.
- Character speech scrapers were constructed in Exist (Malfoys, Weasleys, Harry, Hermione, Filch & Figg) to ensure that there is a representation of all “social” classes in the films.
- Began Textanalysis on the individual character speech text files to determine whether there is a correlation between diction and social class or not.
- Some screenshots have been taken to support the data found in the Textanalysis conducted thus far.
- Week of 4/18-4/24
- Textanalysis was continued this week to prepare for my senior project meeting coming up.
- Strong data was found between each social class and has been represented through screenshots of Voyant and AntConc, and through the embedded iFrames of Voyant (Analysis - Project Analysis)
- Data was then placed into Oxygen where it could be organized into tables, lists, and paragraphs.
- Web development increased dramatically: updated home page to include general project description, general project stages, and future project planning. Furthermore, I added to the Processes page and added my Project Analysis information to the respective web page on the site.
MAY 2021
- Week of 4/25-5/1
- Downloaded PyCharm and completed Intro to Python so I could refresh myself on Python’s language and syntax rules.
- Created bar graph to show the top twenty adjectives used across the series
- I plan on creating bar graphs for the top adjectives used by my social class corpus and how the adjectives further connect with the overall thesis.
- Week of 5/2-5/8
- Created bar graphs to represent top twenty adjectives used by each group of my social class corpus (Malfoys, Weasleys, Harry, Hermione, Filch & Figg)
- Updated process page to include in-depth project process.