ataashihunter: (Default)
[personal profile] ataashihunter
Note: BargeScraper is currently in testing. There may be issues.

What is this:
A little script that automatically outputs dreamwidth friendly html for threadtracking

Installation:
BargeScraper requires a few things to get up and running. These are the installation steps for windows.

Install Python 3:

Go to Python's download page and download the newst version of python 3. Then run the .exe file.


Make sure Add python to PATH is checked and then click Customize Installation


Make sure Pip is checked to be installed


Make sure python is being added to system variables.

Configuring python/updating pip
Once the installation is finished open a command prompt (usually done by going to run and entering cmd). Type "python". You should see this:


Type "pip". You should see this:



If for either of these two you do instead see "python is not recognized as an internal or external command.." it means that python (or pip) was not added to the system path. To fix this type "setx PATH PythonPath" into the commandline where PythonPath is the path to where the python.exe file is.

We now have to update pip. Close the commandline and now open it again but this time run as administrator. To open as administrator on windows 8 you can press windows key+x. more info here. Type "python -m pip install -U pip" and run it.

Install a text editor/setting up folders
When it all works we are ready to install a text editor.
I like notepad++. I install by using the installer and following the instructions.

Now we are getting close. Good job.

Make a new folder for where you want the bargeScraper to live. Mine lives in documents. In the commandline navigate to this folder. The easiest way to do this is to go into the new folder and then copy the path. Then in the commandline type cd and paste the path, then press enter.

Inside this folder on the command line type:
"pip install requests" then press enter
"pip install beautifulsoup4" then press enter

You are now ready.

Configuration and run bargescraper:
Open notepad++ (or other text edior), create a new file. In this file copy in the code from here. Save this file to bargeScrape.py and select save as type python file.
Enter your username and password, which communities you want to scrape (TLV is what is in there) and which months you want to scrape for. Then save again.

Now in the commandline, check that you are in the folder where the file you just made is and then type "python bargeScrape.py". The scraping will take some time. Make yourself some tea.

The html will be outputted to a file in the same folder, you can open this in notepad++ and copy from there.

Configuration options
Communities + Coms titles
- These two lists contain the information about the communities you want to scrape. Add urls to the coms list, make sure to add quotation marks (") and separate by commas. I have added the communities for TLV
- Add the title you want to use for each community in the corresponding place in the comsTitle list. If I want to use the title "Daydreams" for the first community I scrape I put "Daydreams" in the first place in the comsTitle list. Remember to add quotation marks and separate by comma

Months
- There are three options for how to say which month you want to scrape. The first is you give it a start month and an optional end month. If you do not give an end month it will include the current month. The format for these months is "YYYY/MM", remember to use quotation marks
- The second option is to add only the specific months you want to scrape. These are good if you want to check say january, march and august, but don't care for the months in between. Add the months you want to scrape to the list months and leave start and enddates as empty strings (""). The format for these months is "YYYY/MM", remember to use quotation marks and seperate months by comma
- The last option is to not give any months, this causes the default behavior which is to scrape only the current month. Leave the months list empty and the start and end date as ""

Other options
-Filename: Change this if you want a different name for the file with the output html
-Conensed: if you set this to True it will put a cut for each month.
-displayName: if the name you log in as is not the same as your display name add your display name here. This can happen if your login name/url has a - and but the username is displayed with an _
-tagsToCheck: if you also want to add a list of the posts where you need to add your tag add the tag you'll be looking for to the list here. You can check for several tags. Remember to use lower letters and surround the tag with quotation marks and separate several tags with comma. You can use spaces here, example "the iron bull"

Thanks
Thank you to my partner, Claire, for her help and to my beta testers for your feedback, you have helped make this metter

Please share and enjoy the bargescraper and if you need any help you can post a comment below or contact me on @craftyviking on plurk.

Date: 2016-03-01 08:53 pm (UTC)
cantcatchme: (Default)
From: [personal profile] cantcatchme
Threadjacking on a collapsed thread bug:

Traceback (most recent call last):
File "Bargescraper.py", line 155, in
processOneComm(coms[index]+month)
File "Bargescraper.py", line 86, in processOneComm
findComments(toplevelcomments, title)
File "Bargescraper.py", line 48, in findComments
findThreadjack(commentUrl, title)
File "Bargescraper.py", line 59, in findThreadjack
commentUrl = comment.find(class_="commentpermalink").a['href']
AttributeError: 'NoneType' object has no attribute 'a'

Date: 2016-03-02 03:14 pm (UTC)
cantcatchme: (Default)
From: [personal profile] cantcatchme
Now I'm getting:

Scraping for 2015/07
Scraping Logs
Scraping Network
Traceback (most recent call last):
File "Bargescraper.py", line 159, in
processOneComm(coms[index]+month)
File "Bargescraper.py", line 79, in processOneComm
user = fullPostSoup.find(class_ = "ljuser")['lj:user']
TypeError: 'NoneType' object is not subscriptable


This was during a multi-month scan, and is on a month I scanned successfully in this manner before the update. When I scan the month alone, the scraper works as intended. Currently re-trying a multi month scan to see if the problem replicates.

Date: 2016-03-02 05:38 pm (UTC)
cantcatchme: (Default)
From: [personal profile] cantcatchme
Also got this one:

Scraping for 2015/10
Scraping Logs
Scraping Network
Traceback (most recent call last):
File "Bargescraper.py", line 159, in
processOneComm(coms[index]+month)
File "Bargescraper.py", line 90, in processOneComm
findComments(toplevelcomments, title)
File "Bargescraper.py", line 48, in findComments
findThreadjack(commentUrl, title)
File "Bargescraper.py", line 56, in findThreadjack
poster = comment.find(class_="comment-poster").span['lj:user']
AttributeError: 'NoneType' object has no attribute 'span'

Date: 2016-03-02 04:54 am (UTC)
tinkermoose: (Default)
From: [personal profile] tinkermoose
So I've clearly failed somewhere in the whole saving the file to the right place thing, b/c I get to the last step and get told "python: can't open file 'bargeScrape.py': [Errno 2] no such file or directory.

(this is Jay, btw)

Date: 2016-03-02 07:59 am (UTC)
a_very_distinctive: (Default)
From: [personal profile] a_very_distinctive
Ok, think I fixed that. Here's my latest mess:

C:\Users\Authorized User\Documents\Bargescraper>python bargeScrape.py
Traceback (most recent call last):
File "bargeScrape.py", line 1, in
import requests
ImportError: No module named 'requests'

(pretty sure I really do define 'how to get the incompetent through this}

Date: 2016-03-02 09:45 am (UTC)
a_very_distinctive: (Default)
From: [personal profile] a_very_distinctive
aaaand I still fail. This time, all in red:

C:\Users\Authorized User\Documents\Bargescraper>pip install requests
Collecting requests
Using cached requests-2.9.1-py2.py3-none-any.whl
Installing collected packages: requests
Exception:
Traceback (most recent call last):
File "c:\program files (x86)\python35-32\lib\site-packages\pip\basecommand.py"
, line 211, in main
status = self.run(options, args)
File "c:\program files (x86)\python35-32\lib\site-packages\pip\commands\instal
l.py", line 311, in run
root=options.root_path,
File "c:\program files (x86)\python35-32\lib\site-packages\pip\req\req_set.py"
, line 646, in install
**kwargs
File "c:\program files (x86)\python35-32\lib\site-packages\pip\req\req_install
.py", line 803, in install
self.move_wheel_files(self.source_dir, root=root)
File "c:\program files (x86)\python35-32\lib\site-packages\pip\req\req_install
.py", line 998, in move_wheel_files
isolated=self.isolated,
File "c:\program files (x86)\python35-32\lib\site-packages\pip\wheel.py", line
339, in move_wheel_files
clobber(source, lib_dir, True)
File "c:\program files (x86)\python35-32\lib\site-packages\pip\wheel.py", line
310, in clobber
ensure_dir(destdir)
File "c:\program files (x86)\python35-32\lib\site-packages\pip\utils\__init__.
py", line 71, in ensure_dir
os.makedirs(path)
File "c:\program files (x86)\python35-32\lib\os.py", line 241, in makedirs
mkdir(name, mode)
PermissionError: [WinError 5] Access is denied: 'c:\\program files (x86)\\python
35-32\\Lib\\site-packages\\requests'
You are using pip version 7.1.2, however version 8.0.3 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' comm
and.

(no subject)

From: [personal profile] a_very_distinctive - Date: 2016-03-02 04:51 pm (UTC) - Expand

(no subject)

From: [personal profile] a_very_distinctive - Date: 2016-03-02 05:08 pm (UTC) - Expand

Date: 2016-03-03 01:12 pm (UTC)
bleak_midwinter: (Default)
From: [personal profile] bleak_midwinter
i scraped february twice but I keep getting html that makes the post come out like this?

Date: 2016-03-03 01:33 pm (UTC)
bleak_midwinter: (Default)
From: [personal profile] bleak_midwinter
I thought the problem was that it didn't have the brackets, so I added them and now there's even less, haha... here
Edited Date: 2016-03-03 01:33 pm (UTC)

(no subject)

From: [personal profile] bleak_midwinter - Date: 2016-03-03 01:45 pm (UTC) - Expand

(no subject)

From: [personal profile] bleak_midwinter - Date: 2016-03-03 02:08 pm (UTC) - Expand

(no subject)

From: [personal profile] bleak_midwinter - Date: 2016-03-03 02:31 pm (UTC) - Expand

(no subject)

From: [personal profile] bleak_midwinter - Date: 2016-03-04 10:16 am (UTC) - Expand

(no subject)

From: [personal profile] bleak_midwinter - Date: 2016-03-04 10:36 am (UTC) - Expand

(no subject)

From: [personal profile] bleak_midwinter - Date: 2016-03-04 10:41 am (UTC) - Expand

(no subject)

From: [personal profile] bleak_midwinter - Date: 2016-03-04 10:44 am (UTC) - Expand

(no subject)

From: [personal profile] bleak_midwinter - Date: 2016-03-04 10:56 am (UTC) - Expand

(no subject)

From: [personal profile] bleak_midwinter - Date: 2016-03-04 10:44 am (UTC) - Expand

(no subject)

From: [personal profile] bleak_midwinter - Date: 2016-03-04 10:40 am (UTC) - Expand

Date: 2016-08-01 01:08 pm (UTC)
utselet: (Default)
From: [personal profile] utselet
So this has worked perfectly well for me for months, and suddenly I'm getting the same error whenever it tries to scrape the network community. It always gets partway through (anywhere from 5% to 60%) and then I hit:

Scraping Network
Checking 75 posts and their comments.
9 %Traceback (most recent call last):
File "C:\Program Files (x86)\Python35-32\lib\site-packages\requests\packages\urllib3\connectionpool.py", line 385, in _make_request
httplib_response = conn.getresponse(buffering=True)
TypeError: getresponse() got an unexpected keyword argument 'buffering'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:\Program Files (x86)\Python35-32\lib\site-packages\requests\packages\urllib3\connectionpool.py", line 578, in urlopen
chunked=chunked)
File "C:\Program Files (x86)\Python35-32\lib\site-packages\requests\packages\urllib3\connectionpool.py", line 387, in _make_request
httplib_response = conn.getresponse()
File "C:\Program Files (x86)\Python35-32\lib\http\client.py", line 1197, in getresponse
response.begin()
File "C:\Program Files (x86)\Python35-32\lib\http\client.py", line 297, in begin
version, status, reason = self._read_status()
File "C:\Program Files (x86)\Python35-32\lib\http\client.py", line 258, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
File "C:\Program Files (x86)\Python35-32\lib\socket.py", line 575, in readinto
return self._sock.recv_into(b)
ConnectionResetError: [WinError 10054] An existing connection was forcibly closed by the remote host

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:\Program Files (x86)\Python35-32\lib\site-packages\requests\adapters.py", line 403, in send
timeout=timeout
File "C:\Program Files (x86)\Python35-32\lib\site-packages\requests\packages\urllib3\connectionpool.py", line 623, in urlopen
_stacktrace=sys.exc_info()[2])
File "C:\Program Files (x86)\Python35-32\lib\site-packages\requests\packages\urllib3\util\retry.py", line 281, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
requests.packages.urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='127.0.0.1', port=8082): Max retries exceeded with url: http://lastvoyages.dreamwidth.org/367854.html?thread=42837230 (Caused by ProxyError('Cannot connect to proxy.', ConnectionResetError(10054, 'An existing connection was forcibly closed by the remote host', None, 10054, None)))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "bargeScrape.py", line 205, in
processOneComm(coms[index]+month)
File "bargeScrape.py", line 109, in processOneComm
findComments(toplevelcomments, title,tags)
File "bargeScrape.py", line 54, in findComments
findThreadjack(commentUrl, title, tags)
File "bargeScrape.py", line 58, in findThreadjack
commentThreadRaw = c.get(url)
File "C:\Program Files (x86)\Python35-32\lib\site-packages\requests\sessions.py", line 487, in get
return self.request('GET', url, **kwargs)
File "C:\Program Files (x86)\Python35-32\lib\site-packages\requests\sessions.py", line 475, in request
resp = self.send(prep, **send_kwargs)
File "C:\Program Files (x86)\Python35-32\lib\site-packages\requests\sessions.py", line 585, in send
r = adapter.send(request, **kwargs)
File "C:\Program Files (x86)\Python35-32\lib\site-packages\requests\adapters.py", line 465, in send
raise ProxyError(e, request=request)
requests.exceptions.ProxyError: HTTPConnectionPool(host='127.0.0.1', port=8082): Max retries exceeded with url: http://lastvoyages.dreamwidth.org/367854.html?thread=42837230 (Caused by ProxyError('Cannot connect to proxy.', ConnectionResetError(10054, 'An existing connection was forcibly closed by the remote host', None, 10054, None)))


I've updated literally everything I can think of to update (Python, pip, beautifulsoup, requests) and that hasn't seemed to help.

Date: 2016-08-01 01:56 pm (UTC)
new_toys: (Default)
From: [personal profile] new_toys
Update: I didn't think to check what URLs it was timing out on before, but while it's *mostly* that same one, it seems to be alternating between two.

Attempts 1 and 2: http://lastvoyages.dreamwidth.org/367854.html?thread=42837230
Attempts 3 and 4: http://lastvoyages.dreamwidth.org/376386.html?thread=44435266
Attempts 5-9: Back to http://lastvoyages.dreamwidth.org/367854.html?thread=42837230
Attempt 10: Back to http://lastvoyages.dreamwidth.org/376386.html?thread=44435266

Date: 2016-08-03 07:57 am (UTC)
yourhighnessness: (Default)
From: [personal profile] yourhighnessness
Hi! I'm sure I've done something stupid or wrong in setup - but I'm getting this:

C:\Users\Nina>Python
Python 3.5.2 (v3.5.2:4def2a2901a5, Jun 25 2016, 22:01:18) [MSC v.1900 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> pip
Traceback (most recent call last):
File "", line 1, in
NameError: name 'pip' is not defined

Date: 2017-01-04 01:43 am (UTC)
callmefives: (Default)
From: [personal profile] callmefives
I feel like I should be able to fix this/figure it out myself, b/c I know I've changed between accounts successfully before, but I have no damn clue right now... when I try to scrape for Fives' it comes up as finding zero posts to scrape in every comm.

I have it set for December still, and all I needed to do was change the account name and password, then save in notepad then run, right? It just... isn't finding any posts in any of the comms to even try to scrape.

Date: 2017-01-04 09:08 am (UTC)
a_very_distinctive: (Default)
From: [personal profile] a_very_distinctive
I didn't change anything but the journal name and password, so there really shouldn't be.

Date: 2017-01-04 09:11 am (UTC)
hottestrogue: (Default)
From: [personal profile] hottestrogue
Strange. I will look at it when I get home

Date: 2017-01-04 09:11 am (UTC)
a_very_distinctive: (Default)
From: [personal profile] a_very_distinctive
Just tried it again and got the same result

Microsoft Windows [Version 6.1.7601]
Copyright (c) 2009 Microsoft Corporation. All rights reserved.

C:\Users\Authorized User>cd C:\Users\Authorized User\Documents\Bargescraper

C:\Users\Authorized User\Documents\Bargescraper>python bargeScrape.py
Logging in
Login complete. Begin Scraping
Scraping for 2016/12
Scraping Logs
Checking 0 posts and their comments.
100 % Done with this community
Scraping Network
Checking 0 posts and their comments.
100 % Done with this community
Scraping Greatest Hits
Checking 0 posts and their comments.
100 % Done with this community
Scraping OOC
Checking 0 posts and their comments.
100 % Done with this community
Scrape complete. Outout saved to scrapeOutput.html

C:\Users\Authorized User\Documents\Bargescraper>

Date: 2017-01-04 09:12 am (UTC)
hottestrogue: (Default)
From: [personal profile] hottestrogue
Could you send me a copy of the code you are running?

(no subject)

From: [personal profile] a_very_distinctive - Date: 2017-01-04 09:13 am (UTC) - Expand

(no subject)

From: [personal profile] hottestrogue - Date: 2017-01-04 09:15 am (UTC) - Expand

here you go!

From: [personal profile] a_very_distinctive - Date: 2017-01-04 09:29 am (UTC) - Expand

(no subject)

From: [personal profile] callmefives - Date: 2017-02-08 06:25 pm (UTC) - Expand

(no subject)

From: [personal profile] dann_0 - Date: 2017-02-08 06:56 pm (UTC) - Expand

(no subject)

From: [personal profile] a_very_distinctive - Date: 2017-02-08 07:01 pm (UTC) - Expand

(no subject)

From: [personal profile] a_very_distinctive - Date: 2017-02-08 07:21 pm (UTC) - Expand

(no subject)

From: [personal profile] dann_0 - Date: 2017-02-08 07:23 pm (UTC) - Expand

(no subject)

From: [personal profile] a_very_distinctive - Date: 2017-02-08 08:08 pm (UTC) - Expand

Profile

ataashihunter: (Default)
ataashihunter

February 2017

S M T W T F S
   12 34
567891011
12131415161718
19202122232425
262728    

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Sep. 19th, 2017 06:53 pm
Powered by Dreamwidth Studios