
# 1922 Film Industry Trade Press Corpus ## Description of the data and file structure The data is in a .ZIP archive and consists of 23 DJVU text files, one for each publication in the corpus. The titles, dates, and links to the full scans of the publications on the Internet Archive are listed below. | **Publication** | **Location** | **Dates** | **URL** | | :-------------------------------------------------------------- | :-------------- | :---------------------- | :----------------------------------------------------------------------- | | *The American Cinematographer* | Los Angeles, US | July 1922 | | | *Camera* | Los Angeles, US | April 1922–April 1923 | | | *Canadian Moving Picture Digest* | Toronto, CA | May–October 1922 | | | *Cine-mundial* | New York, US | 1922 | | | *Cinéa* | Paris, FR | 1922 | | | *Der Kinematograph* | Düsseldorf, DE | July 1922 | | | *Exhibitor’s Trade Review* | New York, US | June–August 1922 | | | *Exhibitors Herald* | Chicago, US | July–September 1922 | | | *Exhibitors Herald* | Chicago, US | October–December 1922 | | | *The Film Daily* | New York, US | 1922 | | | *The Film Renter and Moving Picture News* | London, UK | July–August 1922 | | | *Motion Picture News* | New York, US | July–August 1922 | | | *The Motion Picture Studio* | London, UK | June 1922–February 1923 | | | *Moving Picture World* | New York, US | July–August 1922 | | | *Paramount Pep* | New York, US | July–December 1922 | | | *Photoplay* | Chicago, US | July–December 1922 | | | *Picturegoer* | London, UK | 1922 | | | *Shadowland* | New York, US | January–May 1922 | | | *Tess of the Storm Country* (United Artists Pressbook) | Los Angeles, US | 1922 | | | *The Great Selection: “First National First” Season 1922 –1923* | New York, US | 1922 | | | *Universal Weekly* | New York, US | 1922 | | | *Variety* | New York, US | July 1922 | | ## Sharing/Access information You can find this data on both the Internet Archive, as a part of the [Media History Digital Library Collection](), and the [Media History Digital Library](https://mediahist.org/). You can also search these, and many other related publications, via [Lantern](https://lantern.mediahist.org/) ## Library Dependencies Using a bunch of external python libraries. We are tracking them all in requirements.txt so that you can install everything with a single command. Best practice involves using a virutal environment via [VENV](https://virtualenv.pypa.io/en/latest/) or [Conda](https://uoa-eresearch.github.io/eresearch-cookbook/recipe/2014/11/20/conda/). We will be using VENV as an example here ([via this tutorial](https://www.dataquest.io/blog/a-complete-guide-to-python-virtual-environments/)): * Create your virtual environment * `python3 -m venv path/to/virtual/environment` * Activate your virtual environment * `source path/to/virtual/environment/bin/activate` * Install required dependencies * `pip install -r requirements.txt` or possibly `python3 -m pip install -r requirements.txt` --- ## Creating Search List Using[Lantern](https://lantern.mediahist.org/) or the [Media History collection on the Internet Archive](https://archive.org/details/mediahistory), create a `.txt` file with the IA identifiers for the publications you wish to include in your corpus with each identifier on a new line. On Lantern you can find the identifier at the end of the URL, but before the underscore and leaf number, and on the Internet Archive you can find the identfier at the end of the URL ** or in the descrption section under the viewer with the label **Identifier**. ### Sample Input List This is a list of a handful of selected publications ``` film-renter-and-moving-picture-news-1922-07 canadian-moving-picture-digest-1922-05 kinematograph-1922-07 paramountpepjuld07unse movingpicturewor57july motionpicturenew26july americancinemato00amer exhibitorsherald15exhi exhibitorstra00newy variety67-1922-07 ``` --- ## Downloading MHDL Files The `downlad.py` file will download items from the 'mediahistory' collection on the Internet Archive that match specified identifiers in your search list. When you run the script you will specify the directory in which to store the files and the name of the CSV file in which to store some basic item metadata - this will be useful when automating similarity detection. This step can be skipped if using the files in this deposit by pointing the`similarityMulti.py` script at the downloaded corpus. ### Usage: `python3 download.py -o OUTPUT_DIR -m METADATA_FILE -i INPUT_FILE` * `OUTPUT_DIR` - specify a directory that you would like to save the text files to * `METADATA_FILE` - specify the name of a CSV file to store item metadata to * `INPUT_FILE` - a text file with a list of IA identifiers `python3 download.py -h` - to display arguments #### Sample Output This is the metadata output from downloading one publication: ``` identifier,title,year,creator,identifier-access pressbook-wb-hot-heiress,"Hot Heiress (Warner Bros. Pressbook, 1931)",1931,Warner Bros.,http://archive.org/details/pressbook-wb-hot-heiress ``` This is the stdout for a simple search of one pressbook: ``` % python3 download.py -o staging -m meta.csv -i search-list.txt *** MHDL Downloader *** Using: search-list.txt as input list. The output directory is: staging and metadata will be saved to: meta.csv pressbook-wb-hot-heiress: downloading pressbook-wb-hot-heiress_djvu.txt: 67.9kiB [00:00, 312kiB/s] *** Finished *** ``` --- ## Internal Similarity Once you've downloaded your film industry press files, you can use the `similarityMulti.py` script to run some basic tests of how similar the publications are to one another. NOTE: The scripts do not remove any stopwords or or implement other quality controls - so the "Scanned for the MHDL" at the end of every scan *will* be counted as similar text. This script will calculate Euclidean Distance and Cosine Distance for each of the files and print basic tables with this information. It will also calculate Levenshtein Distance 3 different ways using [rapidfuzz](https://pypi.org/project/rapidfuzz/), a new implementation of [TheFuzz library from SeatGeek](https://github.com/seatgeek/thefuzz). Via this [explainer from Analytics Vidhya](https://www.analyticsvidhya.com/blog/2021/07/fuzzy-string-matching-a-hands-on-guide/#:~:text=Token%20Sort%20Ratio%20using%20FuzzyWuzzy,is%20calculated%20between%20the%20strings.) the `Ratio` calculates the standard Levenshtein distance similarity ratio between two strings. In `token sort ratio`, the strings are tokenized and pre-processed by converting to lower case and getting rid of punctuation. The strings are then sorted alphabetically and joined together. Post this, the Levenshtein distance similarity ratio is calculated between the strings. `Token set ratio` performs a set operation that takes out the common tokens instead of just tokenizing the strings, sorting, and then pasting the tokens back together. Extra or same repeated words do not matter. ### Usage `python3 similarityMulti.py -d INPUT_DIR -m METADATA_FILE -i INPUT_FILE -o OUTPUT_DIRECTORY` * `INPUT_DIR` - specify a directory that contains the TXT files that you'd like to process * `METADATA_FILE` - specify the name of a CSV file to read item metadata from * `INPUT_FILE` - a text file with a list of IA identifiers to calculate similarity scores for * `OUTPUT_DIR` - specify a directory that you would like to save the text files to --- ## Example Command Line Commands If you wish to repeat the similarity testing from the 1920s Film Trade Press book chapter, the command line commands are: * `python3 download.py -o 1922 -m 1922.csv -i 1922-search-list.txt` * `python3 similarityMulti.py -d 1922 -m 1922.csv -i 1922-search-list.txt -o 1922-Similarity`
For the first half of the twentieth century, no American industry boasted a more motley and prolific trade press than the movie business—a cutthroat landscape that set the stage for battle by ink. In 1930, Martin Quigley, publisher of Exhibitors Herald, conspired with Hollywood studios to eliminate all competing trade papers, yet this attempt and each one thereafter collapsed. Exploring the communities of exhibitors and creative workers that constituted key subscribers, Ink-Stained Hollywood tells the story of how a heterogeneous trade press triumphed by appealing to the foundational aspects of industry culture—taste, vanity, partisanship, and exclusivity. In captivating detail, Eric Hoyt chronicles the histories of well-known trade papers (Variety, Motion Picture Herald) alongside important yet forgotten publications (Film Spectator, Film Mercury, and Camera!), and challenges the canon of film periodicals, offering new interpretative frameworks for understanding print journalism’s relationship with the motion picture industry and its continued impact on creative industries today. We selected the year 1922, with an emphasis on July 1922, for two chief reasons. First, the MHDL had already digitized a wide cross-section of trade papers from that year, including—appropriately for this book—several published outside of the United States. Second, we knew from Eric’s earlier research that there was a great deal of competition within the American film industry’s trade press during this period. In 1922, Variety and the Chicago-based Exhibitors Herald, were pursuing strategies to grow their readership and influence within the industry, emphasizing independence, integrity, and uniqueness as distinguishing factors. During the following year, Exhibitors Herald created the ‘“Herald Only’ Club”—emphasizing the loyalty of subscribers who exclusively wrote into Exhibitors Herald and read the paper, to the exclusion of its rivals (Hartman, Rea). Given the competitive bent of the 1920s trade press, how distinct was each publication? Would the “‘Herald Only’ Club” have any factual grounding once the word patterns, sentences, and page structures were analyzed at scale? In addition to the above-mentioned trade papers, we included 16 additional unique journals. Our corpus included fan magazines (Photoplay, Shadowland, and The Picturegoer), a technical journal (American Cinematographer), English language trade papers published outside the U.S. (Canadian Moving Picture Digest and The Film Renter and Moving Picture News), and studio generated publicity (Universal Weekly and Paramount Pep). This dataset is comprised of the scans of the trade papers we analyzed, as well as a zip archive of the code we used to do the similarity analysis. C.M. Hartman, qtd. in “‘Herald Only’ Club Gains Six; Veteran and Newcomer Give Reasons for Joining,” Exhibitors Herald, March 29, 1924, 63, http://lantern.mediahist.org/catalog/exhibitorsherald18exhi_0_0073 George Rea letter to Exhibitors Herald, Exhibitors Herald, May 26, 1923, 69, http://lantern.mediahist.org/catalog/exhibitorsherald16exhi_0_0869.
The files in the corpus were scanned and OCRed by The Internet Archive using Teseract and hOCR. They are the versions which we used in our analysis. The similarity testing itself was conducted via two python script one which downloaded the scans from the Internet Archive, this one can be skipped if using the corpus in this deposit), and the other which compared the text files using Euclidean Distance, Cosine Distance, and Levenshtein distance metrics. The levenshtein distance metrics were calculate via RapidFuzz and were the standard Levenshtein Distance Ratio (LDR), the Sorted LDR which orders the words into alphabetical order, and the Set LDR which orders the words into alphabetical order and then removes any duplicates.
FOS: Media and communications, film industry, film trade publications, media history digital library, text analysis, film
FOS: Media and communications, film industry, film trade publications, media history digital library, text analysis, film
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
