ataashihunter | OOC: BargeScraper

Note: BargeScraper is currently in testing. There may be issues.

What is this:
A little script that automatically outputs dreamwidth friendly html for threadtracking

Installation:
BargeScraper requires a few things to get up and running. These are the installation steps for windows.

Install Python 3:

Go to Python's download page and download the newst version of python 3. Then run the .exe file.

Make sure Add python to PATH is checked and then click Customize Installation

Make sure Pip is checked to be installed

Make sure python is being added to system variables.

Configuring python/updating pip
Once the installation is finished open a command prompt (usually done by going to run and entering cmd). Type "python". You should see this:

Type "pip". You should see this:

If for either of these two you do instead see "python is not recognized as an internal or external command.." it means that python (or pip) was not added to the system path. To fix this type "setx PATH PythonPath" into the commandline where PythonPath is the path to where the python.exe file is.

We now have to update pip. Close the commandline and now open it again but this time run as administrator. To open as administrator on windows 8 you can press windows key+x. more info here. Type "python -m pip install -U pip" and run it.

Install a text editor/setting up folders
When it all works we are ready to install a text editor.
I like notepad++. I install by using the installer and following the instructions.

Now we are getting close. Good job.

Make a new folder for where you want the bargeScraper to live. Mine lives in documents. In the commandline navigate to this folder. The easiest way to do this is to go into the new folder and then copy the path. Then in the commandline type cd and paste the path, then press enter.

Inside this folder on the command line type:
"pip install requests" then press enter
"pip install beautifulsoup4" then press enter

You are now ready.

Configuration and run bargescraper:
Open notepad++ (or other text edior), create a new file. In this file copy in the code from here. Save this file to bargeScrape.py and select save as type python file.
Enter your username and password, which communities you want to scrape (TLV is what is in there) and which months you want to scrape for. Then save again.

Now in the commandline, check that you are in the folder where the file you just made is and then type "python bargeScrape.py". The scraping will take some time. Make yourself some tea.

The html will be outputted to a file in the same folder, you can open this in notepad++ and copy from there.

Configuration options
Communities + Coms titles
- These two lists contain the information about the communities you want to scrape. Add urls to the coms list, make sure to add quotation marks (") and separate by commas. I have added the communities for TLV
- Add the title you want to use for each community in the corresponding place in the comsTitle list. If I want to use the title "Daydreams" for the first community I scrape I put "Daydreams" in the first place in the comsTitle list. Remember to add quotation marks and separate by comma

Months
- There are three options for how to say which month you want to scrape. The first is you give it a start month and an optional end month. If you do not give an end month it will include the current month. The format for these months is "YYYY/MM", remember to use quotation marks
- The second option is to add only the specific months you want to scrape. These are good if you want to check say january, march and august, but don't care for the months in between. Add the months you want to scrape to the list months and leave start and enddates as empty strings (""). The format for these months is "YYYY/MM", remember to use quotation marks and seperate months by comma
- The last option is to not give any months, this causes the default behavior which is to scrape only the current month. Leave the months list empty and the start and end date as ""

Other options
-Filename: Change this if you want a different name for the file with the output html
-Conensed: if you set this to True it will put a cut for each month.
-displayName: if the name you log in as is not the same as your display name add your display name here. This can happen if your login name/url has a - and but the username is displayed with an _
-tagsToCheck: if you also want to add a list of the posts where you need to add your tag add the tag you'll be looking for to the list here. You can check for several tags. Remember to use lower letters and surround the tag with quotation marks and separate several tags with comma. You can use spaces here, example "the iron bull"

Thanks
Thank you to my partner, Claire, for her help and to my beta testers for your feedback, you have helped make this metter

Please share and enjoy the bargescraper and if you need any help you can post a comment below or contact me on @craftyviking on plurk.

Flat | Top-Level Comments Only

From:

cantcatchme

Threadjacking on a collapsed thread bug:

Traceback (most recent call last):
File "Bargescraper.py", line 155, in
processOneComm(coms[index]+month)
File "Bargescraper.py", line 86, in processOneComm
findComments(toplevelcomments, title)
File "Bargescraper.py", line 48, in findComments
findThreadjack(commentUrl, title)
File "Bargescraper.py", line 59, in findThreadjack
commentUrl = comment.find(class_="commentpermalink").a['href']
AttributeError: 'NoneType' object has no attribute 'a'

From:

ataashihunter

Bug noted, investigation in progress. Will update when a solution is found.

From:

cantcatchme

Now I'm getting:

Scraping for 2015/07
Scraping Logs
Scraping Network
Traceback (most recent call last):
File "Bargescraper.py", line 159, in
processOneComm(coms[index]+month)
File "Bargescraper.py", line 79, in processOneComm
user = fullPostSoup.find(class_ = "ljuser")['lj:user']
TypeError: 'NoneType' object is not subscriptable

This was during a multi-month scan, and is on a month I scanned successfully in this manner before the update. When I scan the month alone, the scraper works as intended. Currently re-trying a multi month scan to see if the problem replicates.

From:

cantcatchme

Also got this one:

Scraping for 2015/10
Scraping Logs
Scraping Network
Traceback (most recent call last):
File "Bargescraper.py", line 159, in
processOneComm(coms[index]+month)
File "Bargescraper.py", line 90, in processOneComm
findComments(toplevelcomments, title)
File "Bargescraper.py", line 48, in findComments
findThreadjack(commentUrl, title)
File "Bargescraper.py", line 56, in findThreadjack
poster = comment.find(class_="comment-poster").span['lj:user']
AttributeError: 'NoneType' object has no attribute 'span'

From:

ataashihunter

I have updated the code with a possible fix for this. Code is here
Please check if it works now.

Edited Date: 2016-03-02 10:54 am (UTC)

From:

tinkermoose

So I've clearly failed somewhere in the whole saving the file to the right place thing, b/c I get to the last step and get told "python: can't open file 'bargeScrape.py': [Errno 2] no such file or directory.

(this is Jay, btw)

From:

ataashihunter

Hi. This is a fairly common error (at least for me). You get it because the file you tell python to run isn't in the folder you are in.
To solve this first check that the file you want to run is in the same folder you are in. You can either do this in the UI or the commandline. For commandline type dir and press enter. You will then get a list of contents of the folder you are in.

Here I have done it on my computer and you can see I have two files, the python file and the output.

The other option is that there is a typo between what you named the file and what you try to run. The second part of the above picture shows that where I try to run "python bargescraper.py" when my file is named without the extra r. The solution here is to type the filename exactly as you named the file.
There is a neat trick to avoid typos like this too. Type python and a space and then the first letter (or first few) of what you named your file and then press tab. Commandline should then autofill the rest of the file name. If you have several files that start with the same letter it will pick the first alphabetaically but you can press tab several times until you have the one you want.

Edited Date: 2016-03-02 07:40 am (UTC)

From:

a_very_distinctive

Ok, think I fixed that. Here's my latest mess:

C:\Users\Authorized User\Documents\Bargescraper>python bargeScrape.py
Traceback (most recent call last):
File "bargeScrape.py", line 1, in
import requests
ImportError: No module named 'requests'

(pretty sure I really do define 'how to get the incompetent through this}

From:

ataashihunter

This means that you didn't install requests. Do this:
"pip install requests" then press enter
"pip install beautifulsoup4" then press enter

From:

a_very_distinctive

aaaand I still fail. This time, all in red:

C:\Users\Authorized User\Documents\Bargescraper>pip install requests
Collecting requests
Using cached requests-2.9.1-py2.py3-none-any.whl
Installing collected packages: requests
Exception:
Traceback (most recent call last):
File "c:\program files (x86)\python35-32\lib\site-packages\pip\basecommand.py"
, line 211, in main
status = self.run(options, args)
File "c:\program files (x86)\python35-32\lib\site-packages\pip\commands\instal
l.py", line 311, in run
root=options.root_path,
File "c:\program files (x86)\python35-32\lib\site-packages\pip\req\req_set.py"
, line 646, in install
**kwargs
File "c:\program files (x86)\python35-32\lib\site-packages\pip\req\req_install
.py", line 803, in install
self.move_wheel_files(self.source_dir, root=root)
File "c:\program files (x86)\python35-32\lib\site-packages\pip\req\req_install
.py", line 998, in move_wheel_files
isolated=self.isolated,
File "c:\program files (x86)\python35-32\lib\site-packages\pip\wheel.py", line
339, in move_wheel_files
clobber(source, lib_dir, True)
File "c:\program files (x86)\python35-32\lib\site-packages\pip\wheel.py", line
310, in clobber
ensure_dir(destdir)
File "c:\program files (x86)\python35-32\lib\site-packages\pip\utils\__init__.
py", line 71, in ensure_dir
os.makedirs(path)
File "c:\program files (x86)\python35-32\lib\os.py", line 241, in makedirs
mkdir(name, mode)
PermissionError: [WinError 5] Access is denied: 'c:\\program files (x86)\\python
35-32\\Lib\\site-packages\\requests'
You are using pip version 7.1.2, however version 8.0.3 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' comm
and.

OOC: BargeScraper

Page Summary

Style Credit

Expand Cut Tags